Problem Statement¶
Objective¶
AllLife Bank aims to expand its base of personal loan customers by converting liability customers (depositors) into borrowers. A previous campaign achieved a conversion rate of over 9%, indicating potential growth opportunities in this area. The current task is to develop a predictive model to identify customer attributes that significantly influence loan purchases and to determine segments of customers with a higher probability of opting for personal loans. This model will assist in targeted marketing efforts, enhancing the effectiveness of future campaigns.
Data Dictionary¶
ID: Customer IDAge: Customer’s age in completed yearsExperience: #years of professional experienceIncome: Annual income of the customer (in thousand dollars)ZIP Code: Home Address ZIP code.Family: the Family size of the customerCCAvg: Average spending on credit cards per month (in thousand dollars)Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/ProfessionalMortgage: Value of house mortgage if any. (in thousand dollars)Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)Online: Do customers use internet banking facilities? (0: No, 1: Yes)CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)
Analysis Conclusions¶
- The post-pruned decision tree exhibits good generalization, performing similarly on both training and test data in terms of recall, precision, and F1-score.
- The pre-pruned tree achieved a perfect recall of 1 on the test set, but its other metrics (precision and F1) were poor, indicating it might be flagging too many non-loan customers as potential loan buyers (high false positives).
- The pre-pruned tree only considered Income, CCAvg, and Family as important features. In contrast, the post-pruned tree is more comprehensive, including Income, Education_2, CCAvg, Education_3, Family, and Age, which is likely to lead to more robust predictions.
- We will select the post-pruned model as the best for this problem for the following reasons:
- It achieves a good recall score of 0.94 on the test set, which is crucial for minimizing false negatives (missing potential loan customers). It also maintains reasonable precision and F1 scores, unlike the pre-pruned tree.
- Although depth is not a direct performance metric, the post-pruned model's structure, considering more features and having a slightly higher depth than some pre-pruned iterations, resulted in better overall performance, particularly the improved test precision compared to the pre-pruned model.
Business Recommendations¶
- Bank's marketing team can deploy this model to identify which of their liability customers have higher potential to purchase a loan.
- Using the likelihood score, the bank can tailor their marketing targets.
- Income and education factors of customers are the most important contributions in the decision-making process.
- Credit card spending habits and family size attributes also play a role.
- Marketing strategies can be tailored by the bank towards customers who have higher income, are more educated, spend highly on their credit cards, and have a bigger family.
- The model is built to reduce false negatives so the marketing team does not lose any potential customers. At the same time, this model exhibits good scores for precision, which means reducing false positives.
1 - Import necessary libraries¶
We'll use decision tree, one of classification algorithm to model this to predict the categorical variable,Personal_Loan.
Instruction: Restart the runtime after installing libraries to ensure correct package versions and ignore dependency warnings.
# Installing the libraries with the specified version.
!pip install numpy==1.25.2 pandas==1.5.3 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 sklearn-pandas==2.2.0 -q --user
# Data manipulation libraries
import pandas as pd
import numpy as np
# Data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Library to split data
from sklearn.model_selection import train_test_split
# To build classification model for prediction
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# To compute various classification metrics
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
)
# to suppress unnecessary warnings
import warnings
warnings.filterwarnings("ignore")
2 - Load the dataset¶
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
Loan = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/Project-2/loan_modelling.csv")
# copy data
data = Loan.copy()
3 - Overview Data¶
3.1 View sample rows¶
data.head()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
data.tail()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4995 | 4996 | 29 | 3 | 40 | 92697 | 1 | 1.9 | 3 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4996 | 4997 | 30 | 4 | 15 | 92037 | 4 | 0.4 | 1 | 85 | 0 | 0 | 0 | 1 | 0 |
| 4997 | 4998 | 63 | 39 | 24 | 93023 | 2 | 0.3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4998 | 4999 | 65 | 40 | 49 | 90034 | 3 | 0.5 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4999 | 5000 | 28 | 4 | 83 | 92612 | 3 | 0.8 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
data.sample(5)
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 469 | 470 | 48 | 23 | 10 | 94609 | 2 | 0.7 | 3 | 0 | 0 | 0 | 0 | 1 | 1 |
| 276 | 277 | 30 | 5 | 22 | 90058 | 4 | 0.5 | 3 | 109 | 0 | 0 | 0 | 1 | 0 |
| 4965 | 4966 | 29 | 5 | 33 | 94709 | 1 | 1.8 | 2 | 78 | 0 | 0 | 0 | 1 | 0 |
| 3059 | 3060 | 61 | 36 | 128 | 94550 | 1 | 2.6 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4440 | 4441 | 43 | 19 | 75 | 90041 | 3 | 0.3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
3.2 Data Shape¶
data.shape
(5000, 14)
- The dataset has 5000 rows and 14 columns.
3.3 Data types¶
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null int64 8 Mortgage 5000 non-null int64 9 Personal_Loan 5000 non-null int64 10 Securities_Account 5000 non-null int64 11 CD_Account 5000 non-null int64 12 Online 5000 non-null int64 13 CreditCard 5000 non-null int64 dtypes: float64(1), int64(13) memory usage: 547.0 KB
- Numerical variables are - ID, Age, Experience, Income, Family, CCAvg, Mortgage,
- Categorical variables are - Although ZIPCode, Education, Perosnal_Loan, Securities_Account, CD_Account, Online, CreditCard are interpreted as numerical, it is categorical variable that is encoded by default
3.4 Statistical Summary¶
data.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ID | 5000.0 | 2500.500000 | 1443.520003 | 1.0 | 1250.75 | 2500.5 | 3750.25 | 5000.0 |
| Age | 5000.0 | 45.338400 | 11.463166 | 23.0 | 35.00 | 45.0 | 55.00 | 67.0 |
| Experience | 5000.0 | 20.104600 | 11.467954 | -3.0 | 10.00 | 20.0 | 30.00 | 43.0 |
| Income | 5000.0 | 73.774200 | 46.033729 | 8.0 | 39.00 | 64.0 | 98.00 | 224.0 |
| ZIPCode | 5000.0 | 93169.257000 | 1759.455086 | 90005.0 | 91911.00 | 93437.0 | 94608.00 | 96651.0 |
| Family | 5000.0 | 2.396400 | 1.147663 | 1.0 | 1.00 | 2.0 | 3.00 | 4.0 |
| CCAvg | 5000.0 | 1.937938 | 1.747659 | 0.0 | 0.70 | 1.5 | 2.50 | 10.0 |
| Education | 5000.0 | 1.881000 | 0.839869 | 1.0 | 1.00 | 2.0 | 3.00 | 3.0 |
| Mortgage | 5000.0 | 56.498800 | 101.713802 | 0.0 | 0.00 | 0.0 | 101.00 | 635.0 |
| Personal_Loan | 5000.0 | 0.096000 | 0.294621 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Securities_Account | 5000.0 | 0.104400 | 0.305809 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| CD_Account | 5000.0 | 0.060400 | 0.238250 | 0.0 | 0.00 | 0.0 | 0.00 | 1.0 |
| Online | 5000.0 | 0.596800 | 0.490589 | 0.0 | 0.00 | 1.0 | 1.00 | 1.0 |
| CreditCard | 5000.0 | 0.294000 | 0.455637 | 0.0 | 0.00 | 0.0 | 1.00 | 1.0 |
- The average age of customers is approximately 45 years, with ages ranging from 23 to 67 years.
- The average professional experience of customers is 20 years. The value of -3 for experience is an anomaly and does not make sense in this context.
- The average income of customers is 73K, and the values may be slightly right-skewed.
- At least 75% of the customers have 3 or fewer people in the family.
- Average spending on credit card is ~1930 with some data points suggesting spending as high as 10K, which is unusually high compared to the majority of the data. This suggests the presence of outliers.
- 50% of the customers have an education level of Graduate or less.
- ~70% of the customers have no mortgage, but the rest have it with some as high as 635K.
- Majority of customers have not taken personal loan in the last campaign and they do not have either securities account or certificate of deposit accounts.
- Approximately 59.68% of the customers use internet banking facilities.
- Approximately 29.40% of the customers use a credit card issued by another bank.
# Count the number of rows where 'Mortgage' is 0
data.loc[data['Mortgage'] == 0]['Mortgage'].value_counts()
| Mortgage | |
|---|---|
| 0 | 3462 |
data.loc[data['Experience'] == -3]['Experience'].value_counts()
| Experience | |
|---|---|
| -3 | 4 |
data['Personal_Loan'].value_counts()
| Personal_Loan | |
|---|---|
| 0 | 4520 |
| 1 | 480 |
data['Securities_Account'].value_counts()
| Securities_Account | |
|---|---|
| 0 | 4478 |
| 1 | 522 |
data['CD_Account'].value_counts()
| CD_Account | |
|---|---|
| 0 | 4698 |
| 1 | 302 |
3.5 Check duplicates and missing values¶
data.duplicated().sum()
0
- There are no duplicate entries in the data.
data.isna().sum()
| 0 | |
|---|---|
| ID | 0 |
| Age | 0 |
| Experience | 0 |
| Income | 0 |
| ZIPCode | 0 |
| Family | 0 |
| CCAvg | 0 |
| Education | 0 |
| Mortgage | 0 |
| Personal_Loan | 0 |
| Securities_Account | 0 |
| CD_Account | 0 |
| Online | 0 |
| CreditCard | 0 |
- There are no missing values in the dataset.
3.6 Dropping columns¶
data = data.drop(['ID'], axis=1)
4 - Data Preprocessing -stage 1¶
4.1 Treat Anomalous Values in the Experience column¶
data["Experience"].unique()
array([ 1, 19, 15, 9, 8, 13, 27, 24, 10, 39, 5, 23, 32, 41, 30, 14, 18,
21, 28, 31, 11, 16, 20, 35, 6, 25, 7, 12, 26, 37, 17, 2, 36, 29,
3, 22, -1, 34, 0, 38, 40, 33, 4, -2, 42, -3, 43])
- Values of -1, -2 and -3 are anomalies
# check for experience < 0
data.loc[data["Experience"] < 0]["Experience"].unique()
array([-1, -2, -3])
# Correcting the experience values
data["Experience"].replace(-1, 1, inplace=True)
data["Experience"].replace(-2, 2, inplace=True)
data["Experience"].replace(-3, 3, inplace=True)
data["Education"].unique()
array([1, 2, 3])
4.2 Feature Engineering¶
# check the number of unique values in the zip code
data["ZIPCode"].nunique()
467
# Converts the data type of the "ZIPCode" to a string.
data["ZIPCode"] = data["ZIPCode"].astype(str)
print(
"Number of unique values if we take first two digits of ZIPCode: ",
data["ZIPCode"].str[0:2].nunique(),
)
Number of unique values if we take first two digits of ZIPCode: 7
data["ZIPCode"] = data["ZIPCode"].str[0:2]
data["ZIPCode"].unique()
array(['91', '90', '94', '92', '93', '95', '96'], dtype=object)
data["ZIPCode"] = data["ZIPCode"].astype("category")
data["ZIPCode"].info()
<class 'pandas.core.series.Series'> RangeIndex: 5000 entries, 0 to 4999 Series name: ZIPCode Non-Null Count Dtype -------------- ----- 5000 non-null category dtypes: category(1) memory usage: 5.4 KB
# Convert the data type of categorical features to 'category'
cat_cols = [
"Education",
"Personal_Loan",
"Securities_Account",
"CD_Account",
"Online",
"CreditCard",
"ZIPCode",
]
data[cat_cols] = data[cat_cols].astype("category")
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 5000 non-null int64 1 Experience 5000 non-null int64 2 Income 5000 non-null int64 3 ZIPCode 5000 non-null category 4 Family 5000 non-null int64 5 CCAvg 5000 non-null float64 6 Education 5000 non-null category 7 Mortgage 5000 non-null int64 8 Personal_Loan 5000 non-null category 9 Securities_Account 5000 non-null category 10 CD_Account 5000 non-null category 11 Online 5000 non-null category 12 CreditCard 5000 non-null category dtypes: category(7), float64(1), int64(5) memory usage: 269.8 KB
5 Exploratory Data Analysis (EDA)¶
5.1 Univariate Analysis¶
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
)
# creating the 2 subplots
# boxplot will be created and a star will indicate the mean value of the column
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
)
# For histogram
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
)
# Add mean to the histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
)
# Add median to the histogram
ax_hist2.axvline(
data[feature].median(), color="orange", linestyle="-"
)
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
5.1.1 Observations on Age¶
histogram_boxplot(data, "Age")
- The average age of customers is approximately 45 years.
- The age distribution appears symmetric.
- There are no apparent outliers in the age data.
5.1.2 Observations on Experience¶
histogram_boxplot(data, "Experience")
- The average experience of customers is 20 years
- There are no outliers
5.1.3 Observations on Income¶
histogram_boxplot(data, "Income")
- The average customer income is $73K
- The income distribution is right-skewed, indicating the presence of outliers.
5.1.4 Observations on CCAvg¶
histogram_boxplot(data, "CCAvg")
- The value distribution is right-skewed
- There are outliers
5.1.5 Observations on Mortgage¶
histogram_boxplot(data, "Mortgage")
- The mortgage value is heavily right-skewed with a high frequency of zero or very low values.
- There are many outliers, indicated by the individual points plotted to the right of the right whisker, suggesting significantly higher mortgage values for some customers.
- The long right whisker indicates a wide spread of values in the upper 25% of the data.
5.1.6 Observations on Family¶
labeled_barplot(data, "Family", perc=True)
- At least 75% of the customers have 3 or fewer people in the family.
5.1.7 Observations on Education¶
labeled_barplot(data,"Education", perc=True)
- 41% of customers have an undergraduate education level.
5.1.8 Observations on Securities_Account¶
labeled_barplot(data, "Securities_Account", perc=True)
- Nearly 90% of customers do not have securities account.
5.1.9 Observations on CD_Account¶
labeled_barplot(data, "CD_Account", perc=True)
- Nearly 95% of customers do not have certificate of deposit account
5.1.10 Observations on Online¶
labeled_barplot(data, "Online", perc=True)
- ~60% of customers use internet banking
5.1.11 Observation on CreditCard¶
labeled_barplot(data, "CreditCard", perc=True)
- Approximately 30% of the customers use a credit card issued by another bank.
5.1.12 Observation on ZIPCode¶
labeled_barplot(data, "ZIPCode", perc=True)
- ~30% of customers live in the Zipcode starting with "94"
- Very less customers from Zipcode 96
5.1.13 Just get the count on Personal_Loan¶
labeled_barplot(data, "Personal_Loan", perc=True)
5.2 Bivariate Analysis¶
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
print(tab)
print("-" * 120)
tab.plot(kind="bar", stacked=True, figsize=(count + 5, 5))
plt.legend(loc="upper left", bbox_to_anchor=(1, 1), title=target)
plt.show()
### function to plot distributions wrt target
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title(f"Distribution of {predictor} for target={str(target_uniq[0])} (not opted for {target})")
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
stat="density",
)
axs[0, 1].set_title(f"Distribution of {predictor} for target={str(target_uniq[1])} (opted for {target})")
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
stat="density",
)
axs[1, 0].set_title(f"Boxplot of {predictor} w.r.t {target} with outliers")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 0],
palette="gist_rainbow"
)
axs[1, 1].set_title(f"Boxplot of {predictor} w.r.t {target} without outliers")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
5.2.1 Correlation check¶
plt.figure(figsize=(15, 7))
sns.heatmap(data.corr(numeric_only=True), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
- Age and Experience are heavily correlated. Higher the age, higher is the experience.
- The correlation between Income and CCAvg is positive.
# scatter plot matrix
plt.figure(figsize=(15, 7))
sns.pairplot(data, hue="Personal_Loan", diag_kind="kde");
<Figure size 1500x700 with 0 Axes>
- The correlation observation between Age and Experince can be double confirmed in this
- Customers with higher income and higher credit card spending seem to have accepted personal loan in the last campaign.
5.2.2 Loan interest vs Education¶
stacked_barplot(data, "Education", "Personal_Loan")
Personal_Loan 0 1 All Education All 4520 480 5000 3 1296 205 1501 2 1221 182 1403 1 2003 93 2096 ------------------------------------------------------------------------------------------------------------------------ Personal_Loan 0 1 Education 3 0.863424 0.136576 2 0.870278 0.129722 1 0.955630 0.044370 ------------------------------------------------------------------------------------------------------------------------
- Customers with a higher education level (graduate and above) are more likely to opt for a loan.
5.2.3 Personal_Loan vs Family¶
stacked_barplot(data, "Family", "Personal_Loan")
Personal_Loan 0 1 All Family All 4520 480 5000 4 1088 134 1222 3 877 133 1010 1 1365 107 1472 2 1190 106 1296 ------------------------------------------------------------------------------------------------------------------------ Personal_Loan 0 1 Family 3 0.868317 0.131683 4 0.890344 0.109656 2 0.918210 0.081790 1 0.927310 0.072690 ------------------------------------------------------------------------------------------------------------------------
- Bigger families are likely to opt for loans
5.2.4 Personal_Loan vs Securities_Account¶
stacked_barplot(data, "Securities_Account", "Personal_Loan")
Personal_Loan 0 1 All Securities_Account All 4520 480 5000 0 4058 420 4478 1 462 60 522 ------------------------------------------------------------------------------------------------------------------------ Personal_Loan 0 1 Securities_Account 1 0.885057 0.114943 0 0.906208 0.093792 ------------------------------------------------------------------------------------------------------------------------
- Securities_Account don't seem to have strong influence on loan purchases.
5.2.5 Personal_Loan vs CD_Account¶
stacked_barplot(data, "CD_Account", "Personal_Loan")
Personal_Loan 0 1 All CD_Account All 4520 480 5000 0 4358 340 4698 1 162 140 302 ------------------------------------------------------------------------------------------------------------------------ Personal_Loan 0 1 CD_Account 1 0.536424 0.463576 0 0.927629 0.072371 ------------------------------------------------------------------------------------------------------------------------
- A significantly higher percentage (46.36%) of customers with a Certificate of Deposit (CD) account purchased a personal loan in the last campaign compared to those without a CD account (7.24%).
5.2.6 Personal_Loan vs Online¶
stacked_barplot(data, "Online", "Personal_Loan")
Personal_Loan 0 1 All Online All 4520 480 5000 1 2693 291 2984 0 1827 189 2016 ------------------------------------------------------------------------------------------------------------------------ Personal_Loan 0 1 Online 1 0.90248 0.09752 0 0.90625 0.09375 ------------------------------------------------------------------------------------------------------------------------
- Use of internet banking do not seem to influence loan purchase.
5.2.7 Personal_Loan vs CreditCard¶
stacked_barplot(data, "CreditCard", "Personal_Loan")
Personal_Loan 0 1 All CreditCard All 4520 480 5000 0 3193 337 3530 1 1327 143 1470 ------------------------------------------------------------------------------------------------------------------------ Personal_Loan 0 1 CreditCard 1 0.902721 0.097279 0 0.904533 0.095467 ------------------------------------------------------------------------------------------------------------------------
- The use of a credit card from another bank does not appear to significantly influence whether a customer purchases a personal loan
5.2.8 Personal_Loan vs ZIPCode¶
stacked_barplot(data, "ZIPCode", "Personal_Loan")
Personal_Loan 0 1 All ZIPCode All 4520 480 5000 94 1334 138 1472 92 894 94 988 95 735 80 815 90 636 67 703 91 510 55 565 93 374 43 417 96 37 3 40 ------------------------------------------------------------------------------------------------------------------------ Personal_Loan 0 1 ZIPCode 93 0.896882 0.103118 95 0.901840 0.098160 91 0.902655 0.097345 90 0.904694 0.095306 92 0.904858 0.095142 94 0.906250 0.093750 96 0.925000 0.075000 ------------------------------------------------------------------------------------------------------------------------
- Zip code do not seem to influence loan purchase.
5.2.9 Loan interest vs Age¶
distribution_plot_wrt_target(data, "Age", "Personal_Loan")
- Based on the distribution plots, age does not appear to significantly influence personal loan purchases.
5.2.10 Personal Loan vs Experience¶
distribution_plot_wrt_target(data, "Experience", "Personal_Loan")
- Based on the distribution plots, the distributions of Age and Experience look quite similar.
5.2.11 Personal Loan vs Income¶
distribution_plot_wrt_target(data, "Income", "Personal_Loan")
- Based on the distribution plots, higher income seems to be associated with a higher likelihood of opting for a personal loan.
5.2.12 Personal Loan vs CCAvg¶
distribution_plot_wrt_target(data, "CCAvg", "Personal_Loan")
- Based on the distribution plots, higher credit card spending seems to be associated with a higher likelihood of opting for a personal loan.
5.2.13 Personal Loan vs Mortgage¶
distribution_plot_wrt_target(data, "Mortgage", "Personal_Loan")
- Mortgage value does not seem to influence personal loan purchases significantly.
5.3 EDA observations¶
- Based on the exploratory data analysis, the features that may influence personal loan purchases are Income, CCAvg, Education, Family, and CD_Account.
- Let's proceed to build a decision tree and evaluate further.
6 Data Preprocessing - stage 2¶
6.1 Outlier Detection¶
Let's find the percentage of outliers, in each column of the data, using IQR
# To find the 25th percentile
Q1 = data.select_dtypes(include=["float64", "int64"]).quantile(0.25)
# To find the 75th percentile
Q3 = data.select_dtypes(include=["float64", "int64"]).quantile(0.75)
print("type(Q1) = ", type(Q1))
print(Q1)
type(Q1) = <class 'pandas.core.series.Series'> Age 35.0 Experience 10.0 Income 39.0 Family 1.0 CCAvg 0.7 Mortgage 0.0 Name: 0.25, dtype: float64
# Compute Inter Quantile Range (75th percentile - 25th percentile)
IQR = Q3 - Q1
# Finding lower and upper bounds for all numerical features. All values outside these bounds are outliers
lower_whisker = (Q1 - 1.5 * IQR)
upper_whisker = (Q3 + 1.5 * IQR)
# Calculate the percentage of outliers for each numerical column
(
(data.select_dtypes(include=["float64", "int64"]) < lower_whisker)
| (data.select_dtypes(include=["float64", "int64"]) > upper_whisker)
).sum() / len(data) * 100
| 0 | |
|---|---|
| Age | 0.00 |
| Experience | 0.00 |
| Income | 1.92 |
| Family | 0.00 |
| CCAvg | 6.48 |
| Mortgage | 5.82 |
- Income has 1.92% outliers, CCAvg has 6.48%, and Mortgage has 5.82%.
- Based on EDA, Income and CCAvg appear to influence loan purchase, while Mortgage does not.
- Since the percentage of outliers in Income and CCAvg is relatively small, and their distributions are continuous according to the box plots in sections 5.1.3 and 5.1.4, these entries will be retained for model building.
6.2 Data Preparation for Modeling¶
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 5000 non-null int64 1 Experience 5000 non-null int64 2 Income 5000 non-null int64 3 ZIPCode 5000 non-null category 4 Family 5000 non-null int64 5 CCAvg 5000 non-null float64 6 Education 5000 non-null category 7 Mortgage 5000 non-null int64 8 Personal_Loan 5000 non-null category 9 Securities_Account 5000 non-null category 10 CD_Account 5000 non-null category 11 Online 5000 non-null category 12 CreditCard 5000 non-null category dtypes: category(7), float64(1), int64(5) memory usage: 269.8 KB
# Drop Experience column as it is perfectly correlated with Age
X = data.drop(["Personal_Loan", "Experience"], axis=1)
Y = data["Personal_Loan"]
# ZIPCode and Education are one hot encoded.
# Other categorical features do not need one-hot encoding as they have 0 or 1 values.
X = pd.get_dummies(X, columns=["ZIPCode", "Education"], drop_first=True)
X = X.astype(float)
# Splitting data in train and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, Y, test_size=0.30, random_state=1, stratify=Y
)
print("Shape of Training set : ", X_train.shape)
print("Shape of test set : ", X_test.shape)
print("Percentage of classes in training set:")
print(y_train.value_counts(normalize=True))
print("Percentage of classes in test set:")
print(y_test.value_counts(normalize=True))
Shape of Training set : (3500, 17) Shape of test set : (1500, 17) Percentage of classes in training set: 0 0.904 1 0.096 Name: Personal_Loan, dtype: float64 Percentage of classes in test set: 0 0.904 1 0.096 Name: Personal_Loan, dtype: float64
7 Model Building¶
7.1 Model Evaluation Criterion¶
The objective is to build a model that will help the marketing department identify potential customers with a higher probability of purchasing a loan.
When building a classification model, two types of errors can occur:
- False Positives (FP): The model predicts a customer will purchase a loan (1), but they do not (0). This means the bank will spend marketing resources on a customer who is not interested.
- False Negatives (FN): The model predicts a customer will not purchase a loan (0), but they would have (1). This means the bank misses an opportunity to acquire a loan customer.
Which error is more important to minimize?
In this scenario, minimizing False Negatives (FN) is more critical. Missing a potential loan customer means lost revenue for the bank. While False Positives result in wasted marketing effort, the cost of sending an email is relatively low compared to the potential profit from a loan.
How can we minimize False Negatives?
To minimize False Negatives, we should prioritize the Recall metric. Recall measures the proportion of actual positive cases (customers who would have purchased a loan) that were correctly identified by the model. A higher Recall means fewer potential loan customers are missed.
$\text{Recall} = \frac{\text{TP}}{\text{TP + FN}}$
First, let's create functions to calculate different metrics and confusion matrix so that we don't have to use the same code repeatedly for each model.
- The model_performance_classification_sklearn function will be used to check the model performance of models.
- The confusion_matrix_sklearnfunction will be used to plot confusion matrix.
# Define a utility function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
# Define a utility function to plot confusion matrix
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
7.2 Decision Tree (sklearn default)¶
# Create an instance of the decision tree model
model_default = DecisionTreeClassifier(criterion="gini", random_state=1)
# Fit the model to the training data
model_default.fit(X_train, y_train)
DecisionTreeClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(random_state=1)
7.2.1 Check model performance for default model on training data¶
confusion_matrix_sklearn(model_default, X_train, y_train)
decision_tree_perf_train = model_performance_classification_sklearn(
model_default, X_train, y_train
)
decision_tree_perf_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 1.0 | 1.0 | 1.0 | 1.0 |
7.2.2 Visualizing the Decision Tree for default model on training data¶
feature_names = list(X_train.columns)
print(feature_names)
['Age', 'Income', 'Family', 'CCAvg', 'Mortgage', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'ZIPCode_91', 'ZIPCode_92', 'ZIPCode_93', 'ZIPCode_94', 'ZIPCode_95', 'ZIPCode_96', 'Education_2', 'Education_3']
plt.figure(figsize=(20, 30))
out = tree.plot_tree(
model_default,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# Add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree
print(
tree.export_text(
model_default, # specify the model
feature_names=feature_names, # specify the feature names
show_weights=True # specify whether or not to show the weights associated with the model
)
)
|--- Income <= 104.50 | |--- CCAvg <= 2.95 | | |--- weights: [2519.00, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- Income <= 92.50 | | | |--- CD_Account <= 0.50 | | | | |--- Age <= 26.50 | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- Age > 26.50 | | | | | |--- Income <= 81.50 | | | | | | |--- Age <= 36.50 | | | | | | | |--- Education_2 <= 0.50 | | | | | | | | |--- weights: [12.00, 0.00] class: 0 | | | | | | | |--- Education_2 > 0.50 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | |--- Age > 36.50 | | | | | | | |--- weights: [61.00, 0.00] class: 0 | | | | | |--- Income > 81.50 | | | | | | |--- Online <= 0.50 | | | | | | | |--- Age <= 30.00 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Age > 30.00 | | | | | | | | |--- Age <= 45.00 | | | | | | | | | |--- weights: [6.00, 0.00] class: 0 | | | | | | | | |--- Age > 45.00 | | | | | | | | | |--- CCAvg <= 3.05 | | | | | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | | | | | | |--- CCAvg > 3.05 | | | | | | | | | | |--- CCAvg <= 3.70 | | | | | | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | | | | | | | |--- CCAvg > 3.70 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | |--- Online > 0.50 | | | | | | | |--- Education_2 <= 0.50 | | | | | | | | |--- weights: [25.00, 0.00] class: 0 | | | | | | | |--- Education_2 > 0.50 | | | | | | | | |--- CCAvg <= 3.55 | | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | | | |--- CCAvg > 3.55 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | |--- CD_Account > 0.50 | | | | |--- Age <= 40.50 | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | |--- Age > 40.50 | | | | | |--- weights: [0.00, 3.00] class: 1 | | |--- Income > 92.50 | | | |--- CCAvg <= 4.45 | | | | |--- Education_3 <= 0.50 | | | | | |--- Education_2 <= 0.50 | | | | | | |--- Age <= 61.50 | | | | | | | |--- CCAvg <= 4.35 | | | | | | | | |--- weights: [8.00, 0.00] class: 0 | | | | | | | |--- CCAvg > 4.35 | | | | | | | | |--- Online <= 0.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- Online > 0.50 | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | |--- Age > 61.50 | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | |--- Education_2 > 0.50 | | | | | | |--- Age <= 61.00 | | | | | | | |--- weights: [0.00, 5.00] class: 1 | | | | | | |--- Age > 61.00 | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | |--- Education_3 > 0.50 | | | | | |--- Family <= 2.50 | | | | | | |--- CCAvg <= 3.85 | | | | | | | |--- Age <= 36.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- Age > 36.50 | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | |--- CCAvg > 3.85 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | |--- Family > 2.50 | | | | | | |--- weights: [0.00, 6.00] class: 1 | | | |--- CCAvg > 4.45 | | | | |--- Age <= 57.50 | | | | | |--- Securities_Account <= 0.50 | | | | | | |--- weights: [13.00, 0.00] class: 0 | | | | | |--- Securities_Account > 0.50 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- Age > 57.50 | | | | | |--- weights: [0.00, 1.00] class: 1 |--- Income > 104.50 | |--- Family <= 2.50 | | |--- Education_3 <= 0.50 | | | |--- Education_2 <= 0.50 | | | | |--- weights: [458.00, 0.00] class: 0 | | | |--- Education_2 > 0.50 | | | | |--- Income <= 116.50 | | | | | |--- CCAvg <= 2.85 | | | | | | |--- Age <= 28.50 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- Age > 28.50 | | | | | | | |--- weights: [8.00, 0.00] class: 0 | | | | | |--- CCAvg > 2.85 | | | | | | |--- weights: [0.00, 6.00] class: 1 | | | | |--- Income > 116.50 | | | | | |--- weights: [0.00, 54.00] class: 1 | | |--- Education_3 > 0.50 | | | |--- Income <= 116.50 | | | | |--- CCAvg <= 1.10 | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | |--- CCAvg > 1.10 | | | | | |--- Age <= 33.00 | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | |--- Age > 33.00 | | | | | | |--- CCAvg <= 3.27 | | | | | | | |--- Age <= 50.50 | | | | | | | | |--- weights: [0.00, 5.00] class: 1 | | | | | | | |--- Age > 50.50 | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | |--- CCAvg > 3.27 | | | | | | | |--- Age <= 50.00 | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | |--- Age > 50.00 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | |--- Income > 116.50 | | | | |--- weights: [0.00, 67.00] class: 1 | |--- Family > 2.50 | | |--- Income <= 114.50 | | | |--- Age <= 28.50 | | | | |--- weights: [9.00, 0.00] class: 0 | | | |--- Age > 28.50 | | | | |--- Age <= 60.00 | | | | | |--- Family <= 3.50 | | | | | | |--- CCAvg <= 2.90 | | | | | | | |--- Education_3 <= 0.50 | | | | | | | | |--- weights: [6.00, 0.00] class: 0 | | | | | | | |--- Education_3 > 0.50 | | | | | | | | |--- CCAvg <= 1.00 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- CCAvg > 1.00 | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | |--- CCAvg > 2.90 | | | | | | | |--- CCAvg <= 4.30 | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | | |--- CCAvg > 4.30 | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- Family > 3.50 | | | | | | |--- Age <= 34.00 | | | | | | | |--- CCAvg <= 2.15 | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | |--- CCAvg > 2.15 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | |--- Age > 34.00 | | | | | | | |--- weights: [0.00, 9.00] class: 1 | | | | |--- Age > 60.00 | | | | | |--- weights: [9.00, 0.00] class: 0 | | |--- Income > 114.50 | | | |--- weights: [0.00, 155.00] class: 1
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
model_default.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp Income 0.357039 Family 0.207239 Education_2 0.163788 Education_3 0.146424 CCAvg 0.059631 Age 0.052700 CD_Account 0.005728 Online 0.004393 Securities_Account 0.003057 ZIPCode_91 0.000000 ZIPCode_92 0.000000 ZIPCode_93 0.000000 ZIPCode_94 0.000000 ZIPCode_95 0.000000 ZIPCode_96 0.000000 Mortgage 0.000000 CreditCard 0.000000
importances = model_default.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
7.2.3 Check model performance for default model on test data¶
confusion_matrix_sklearn(model_default, X_test, y_test)
decision_tree_perf_test = model_performance_classification_sklearn(model_default, X_test, y_test)
decision_tree_perf_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.981333 | 0.861111 | 0.939394 | 0.898551 |
print(decision_tree_perf_train)
print("\n")
print(decision_tree_perf_test)
Accuracy Recall Precision F1 0 1.0 1.0 1.0 1.0 Accuracy Recall Precision F1 0 0.981333 0.861111 0.939394 0.898551
- The default decision tree model exhibits overfitting. It achieves perfect predictions on the training data but performs less effectively on unseen test data.
7.3 Pre-pruning - performance improvement¶
# Define the parameters of the tree to iterate over
max_depth_values = np.arange(2, 11, 2)
max_leaf_nodes_values = [50, 75, 150, 250]
min_samples_split_values = [10, 30, 50, 70]
# Initialize variables to store the best model and its performance
best_estimator = None
best_score_diff = float('inf')
best_test_score = 0.0
# Iterate over all combinations of the specified parameter values
for max_depth in max_depth_values:
for max_leaf_nodes in max_leaf_nodes_values:
for min_samples_split in min_samples_split_values:
# Initialize the tree with the current set of parameters
estimator = DecisionTreeClassifier(
max_depth=max_depth,
max_leaf_nodes=max_leaf_nodes,
min_samples_split=min_samples_split,
class_weight='balanced',
random_state=42
)
# Fit the model to the training data
estimator.fit(X_train, y_train)
# Make predictions on the training and test sets
y_train_pred = estimator.predict(X_train)
y_test_pred = estimator.predict(X_test)
# Calculate recall scores for training and test sets
train_recall_score = recall_score(y_train, y_train_pred)
test_recall_score = recall_score(y_test, y_test_pred)
# Calculate the absolute difference between training and test recall scores
score_diff = abs(train_recall_score - test_recall_score)
# Update the best estimator and best score if the current one has a smaller score difference
if (score_diff < best_score_diff) & (test_recall_score > best_test_score):
best_score_diff = score_diff
best_test_score = test_recall_score
best_estimator = estimator
# Print the best parameters
print("Best parameters found:")
print(f"Max depth: {best_estimator.max_depth}")
print(f"Max leaf nodes: {best_estimator.max_leaf_nodes}")
print(f"Min samples split: {best_estimator.min_samples_split}")
print(f"Best test recall score: {best_test_score}")
Best parameters found: Max depth: 2 Max leaf nodes: 50 Min samples split: 10 Best test recall score: 1.0
# Fit the best algorithm to the data.
estimator = best_estimator
estimator.fit(X_train, y_train)
DecisionTreeClassifier(class_weight='balanced', max_depth=2, max_leaf_nodes=50,
min_samples_split=10, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(class_weight='balanced', max_depth=2, max_leaf_nodes=50,
min_samples_split=10, random_state=42)7.3.1 Check performance on training data¶
confusion_matrix_sklearn(estimator, X_train, y_train)
decision_tree_pre_tune_train = model_performance_classification_sklearn(estimator, X_train, y_train)
decision_tree_pre_tune_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.8 | 1.0 | 0.324324 | 0.489796 |
7.3.2 Visualize the Decision Tree¶
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
estimator,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
|--- Income <= 98.50 | |--- CCAvg <= 2.95 | | |--- weights: [1362.83, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- weights: [70.80, 93.75] class: 1 |--- Income > 98.50 | |--- Family <= 2.50 | | |--- weights: [285.95, 739.58] class: 1 | |--- Family > 2.50 | | |--- weights: [30.42, 916.67] class: 1
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
estimator.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp Income 0.870403 CCAvg 0.079565 Family 0.050032 Age 0.000000 ZIPCode_92 0.000000 Education_2 0.000000 ZIPCode_96 0.000000 ZIPCode_95 0.000000 ZIPCode_94 0.000000 ZIPCode_93 0.000000 CreditCard 0.000000 ZIPCode_91 0.000000 Online 0.000000 CD_Account 0.000000 Securities_Account 0.000000 Mortgage 0.000000 Education_3 0.000000
importances = estimator.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
7.3.3 Checking performance on test data¶
confusion_matrix_sklearn(estimator, X_test, y_test)
decision_tree_pre_tune_test = model_performance_classification_sklearn(estimator, X_test, y_test)
decision_tree_pre_tune_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.816 | 1.0 | 0.342857 | 0.510638 |
# Let's just print the metrics for train and test for pre-pruned model
quick_comp = pd.concat(
[decision_tree_pre_tune_train, decision_tree_pre_tune_test], axis=0,
)
quick_comp.index = ["Training set (Pre-Pruned)", "Test set (Pre-Pruned)"]
print("Training performance comparison:")
quick_comp
Training performance comparison:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| Training set (Pre-Pruned) | 0.800 | 1.0 | 0.324324 | 0.489796 |
| Test set (Pre-Pruned) | 0.816 | 1.0 | 0.342857 | 0.510638 |
# Let's just print the metrics derived so far on train and test for default and pre-pruned model
# for quick comparision
quick_comp = pd.concat(
[decision_tree_perf_train,decision_tree_perf_test, decision_tree_pre_tune_train, decision_tree_pre_tune_test], axis=0,
)
quick_comp.index = ["Training set (Default)", "Test set (Default)", "Training set (Pre-Pruned)", "Test set (Pre-Pruned)"]
print("Performance comparison so far:")
quick_comp
Performance comparison so far:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| Training set (Default) | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
| Test set (Default) | 0.981333 | 0.861111 | 0.939394 | 0.898551 |
| Training set (Pre-Pruned) | 0.800000 | 1.000000 | 0.324324 | 0.489796 |
| Test set (Pre-Pruned) | 0.816000 | 1.000000 | 0.342857 | 0.510638 |
- The pre-pruned model achieved a recall of 1 with the parameters
max_depth=2,max_leaf_nodes=50, andmin_samples_split=10. - However, the precision and F1 scores are low for the pre-pruned model.
- The pre-pruned model identified Income, CCAvg, and Family as the important features for predicting the probability of a customer purchasing a loan.
7.4 Post-pruning - performance improvement¶
clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.000000 | 0.000000 |
| 1 | 0.000250 | 0.000500 |
| 2 | 0.000257 | 0.001014 |
| 3 | 0.000276 | 0.001566 |
| 4 | 0.000286 | 0.002137 |
| 5 | 0.000343 | 0.002480 |
| 6 | 0.000400 | 0.003680 |
| 7 | 0.000429 | 0.004109 |
| 8 | 0.000429 | 0.004537 |
| 9 | 0.000457 | 0.004995 |
| 10 | 0.000467 | 0.005461 |
| 11 | 0.000470 | 0.009222 |
| 12 | 0.000484 | 0.010189 |
| 13 | 0.000488 | 0.010677 |
| 14 | 0.000495 | 0.011667 |
| 15 | 0.000508 | 0.012175 |
| 16 | 0.000583 | 0.012758 |
| 17 | 0.000595 | 0.013354 |
| 18 | 0.000667 | 0.016023 |
| 19 | 0.000938 | 0.016961 |
| 20 | 0.000989 | 0.017950 |
| 21 | 0.000994 | 0.018944 |
| 22 | 0.001076 | 0.021097 |
| 23 | 0.001625 | 0.022723 |
| 24 | 0.001782 | 0.024505 |
| 25 | 0.001908 | 0.026413 |
| 26 | 0.002335 | 0.028748 |
| 27 | 0.002970 | 0.031718 |
| 28 | 0.008156 | 0.039874 |
| 29 | 0.025722 | 0.091318 |
| 30 | 0.034690 | 0.126007 |
| 31 | 0.047561 | 0.173568 |
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
Next, we train a decision tree using effective alphas. The last value
in ccp_alphas is the alpha value that prunes the whole tree,
leaving the tree, clfs[-1], with one node.
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)
clf.fit(X_train, y_train)
clfs.append(clf)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.04756053380018527
For the remainder, we remove the last element in
clfs and ccp_alphas, because it is the trivial tree with only one
node. Here we show that the number of nodes and tree depth decreases as alpha
increases.
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
Recall vs alpha for training and testing sets
recall_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = recall_score(y_train, pred_train)
recall_train.append(values_train)
recall_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = recall_score(y_test, pred_test)
recall_test.append(values_test)
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post")
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
index_best_model = np.argmax(recall_test)
best_model = clfs[index_best_model]
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.0009377289377289376, random_state=1)
best_alpha = best_model.ccp_alpha
print(best_alpha)
0.0009377289377289376
estimator_2 = DecisionTreeClassifier(
#ccp_alpha=best_alpha, class_weight={0: 0.15, 1: 0.85}, random_state=1
ccp_alpha=best_alpha, class_weight="balanced", random_state=1
)
estimator_2.fit(X_train, y_train)
DecisionTreeClassifier(ccp_alpha=0.0009377289377289376, class_weight='balanced',
random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(ccp_alpha=0.0009377289377289376, class_weight='balanced',
random_state=1)7.4.1 Checking performance on training data¶
confusion_matrix_sklearn(estimator_2, X_train, y_train)
decision_tree_post_tune_train = model_performance_classification_sklearn(estimator_2, X_train, y_train)
decision_tree_post_tune_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.982286 | 1.0 | 0.844221 | 0.915531 |
7.4.2 Visualizing the Decision Tree¶
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
estimator_2,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(estimator_2, feature_names=feature_names, show_weights=True))
|--- Income <= 98.50 | |--- CCAvg <= 2.95 | | |--- weights: [1362.83, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- Income <= 81.50 | | | |--- Age <= 36.50 | | | | |--- Family <= 3.50 | | | | | |--- weights: [1.11, 15.62] class: 1 | | | | |--- Family > 3.50 | | | | | |--- weights: [6.08, 0.00] class: 0 | | | |--- Age > 36.50 | | | | |--- weights: [33.74, 0.00] class: 0 | | |--- Income > 81.50 | | | |--- CCAvg <= 4.40 | | | | |--- Age <= 46.00 | | | | | |--- Income <= 90.50 | | | | | | |--- weights: [7.74, 0.00] class: 0 | | | | | |--- Income > 90.50 | | | | | | |--- weights: [2.21, 10.42] class: 1 | | | | |--- Age > 46.00 | | | | | |--- weights: [11.06, 67.71] class: 1 | | | |--- CCAvg > 4.40 | | | | |--- weights: [8.85, 0.00] class: 0 |--- Income > 98.50 | |--- Family <= 2.50 | | |--- Education_3 <= 0.50 | | | |--- Education_2 <= 0.50 | | | | |--- Income <= 101.50 | | | | | |--- CCAvg <= 2.95 | | | | | | |--- weights: [2.77, 0.00] class: 0 | | | | | |--- CCAvg > 2.95 | | | | | | |--- weights: [0.55, 15.62] class: 1 | | | | |--- Income > 101.50 | | | | | |--- weights: [263.27, 0.00] class: 0 | | | |--- Education_2 > 0.50 | | | | |--- Income <= 103.50 | | | | | |--- weights: [4.42, 0.00] class: 0 | | | | |--- Income > 103.50 | | | | | |--- weights: [4.98, 322.92] class: 1 | | |--- Education_3 > 0.50 | | | |--- Income <= 116.50 | | | | |--- CCAvg <= 1.10 | | | | | |--- weights: [3.87, 0.00] class: 0 | | | | |--- CCAvg > 1.10 | | | | | |--- weights: [6.08, 52.08] class: 1 | | | |--- Income > 116.50 | | | | |--- weights: [0.00, 348.96] class: 1 | |--- Family > 2.50 | | |--- Income <= 112.50 | | | |--- CCAvg <= 2.75 | | | | |--- Income <= 106.50 | | | | | |--- weights: [14.93, 0.00] class: 0 | | | | |--- Income > 106.50 | | | | | |--- Income <= 111.50 | | | | | | |--- weights: [3.87, 20.83] class: 1 | | | | | |--- Income > 111.50 | | | | | | |--- weights: [4.42, 0.00] class: 0 | | | |--- CCAvg > 2.75 | | | | |--- Age <= 59.50 | | | | | |--- weights: [1.11, 52.08] class: 1 | | | | |--- Age > 59.50 | | | | | |--- weights: [2.77, 0.00] class: 0 | | |--- Income > 112.50 | | | |--- weights: [3.32, 843.75] class: 1
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
estimator_2.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp Income 0.667899 Education_2 0.151815 CCAvg 0.074623 Education_3 0.052674 Family 0.040114 Age 0.012875 CD_Account 0.000000 Online 0.000000 Securities_Account 0.000000 ZIPCode_91 0.000000 ZIPCode_92 0.000000 ZIPCode_93 0.000000 ZIPCode_94 0.000000 ZIPCode_95 0.000000 ZIPCode_96 0.000000 Mortgage 0.000000 CreditCard 0.000000
importances = estimator_2.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(8, 8))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
7.4.3 Checking performance on test data¶
confusion_matrix_sklearn(estimator_2, X_test, y_test)
decision_tree_post_tune_test = model_performance_classification_sklearn(estimator_2, X_test, y_test)
decision_tree_post_tune_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.971333 | 0.944444 | 0.795322 | 0.863492 |
- The recall on the test data for the post-pruned model is 0.94, which is lower than the perfect recall of 1.0 observed on the training set.
- However, the precision score for the post-pruned model has increased compared to the pre-pruned model.
- Although the primary focus is on the recall metric, the precision and F1 scores are also important to consider for a comprehensive evaluation.
8 Model Performance Comparison and Final Model Selection¶
# training performance comparision
models_train_comp_df = pd.concat(
[decision_tree_perf_train, decision_tree_pre_tune_train, decision_tree_post_tune_train], axis=0,
)
models_train_comp_df.index = ["Decision Tree (sklearn default)", "Decision Tree (Pre-Pruning)", "Decision Tree (Post-Pruning)"]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| Decision Tree (sklearn default) | 1.000000 | 1.0 | 1.000000 | 1.000000 |
| Decision Tree (Pre-Pruning) | 0.800000 | 1.0 | 0.324324 | 0.489796 |
| Decision Tree (Post-Pruning) | 0.982286 | 1.0 | 0.844221 | 0.915531 |
# testing performance comparison
models_test_comp_df = pd.concat(
[decision_tree_perf_test, decision_tree_pre_tune_test, decision_tree_post_tune_test], axis=0,
)
models_test_comp_df.index = ["Decision Tree (sklearn default)", "Decision Tree (Pre-Pruning)", "Decision Tree (Post-Pruning)"]
print("Test set performance comparison:")
models_test_comp_df
Test set performance comparison:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| Decision Tree (sklearn default) | 0.981333 | 0.861111 | 0.939394 | 0.898551 |
| Decision Tree (Pre-Pruning) | 0.816000 | 1.000000 | 0.342857 | 0.510638 |
| Decision Tree (Post-Pruning) | 0.971333 | 0.944444 | 0.795322 | 0.863492 |
# Let's compare all the metrics for default, pre-pruned and post-pruned models on train and test data
models_metrics_comp = pd.concat(
[
decision_tree_perf_train,
decision_tree_perf_test,
decision_tree_pre_tune_train,
decision_tree_pre_tune_test,
decision_tree_post_tune_train,
decision_tree_post_tune_test
], axis=0,
)
models_metrics_comp.index = [
"Training set (Default)",
"Test set (Default)",
"Training set (Pre-Pruned)",
"Test set (Pre-Pruned)",
"Training set (Post-Pruned)",
"Test set (Post-Pruned)"]
print("Performance comparison of defaut tree model, pre-pruned and post-pruned in one frame:")
models_metrics_comp
Performance comparison of defaut tree model, pre-pruned and post-pruned in one frame:
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| Training set (Default) | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
| Test set (Default) | 0.981333 | 0.861111 | 0.939394 | 0.898551 |
| Training set (Pre-Pruned) | 0.800000 | 1.000000 | 0.324324 | 0.489796 |
| Test set (Pre-Pruned) | 0.816000 | 1.000000 | 0.342857 | 0.510638 |
| Training set (Post-Pruned) | 0.982286 | 1.000000 | 0.844221 | 0.915531 |
| Test set (Post-Pruned) | 0.971333 | 0.944444 | 0.795322 | 0.863492 |
8.1 Compare feature importance side by side¶
fig, axes = plt.subplots(1, 3, figsize=(24, 8))
# plot for feature_importance of default model
importances = model_default.feature_importances_
indices = np.argsort(importances)
axes[0].set_title("Feature Importances (Default Model)")
axes[0].barh(range(len(indices)), importances[indices], color="violet", align="center")
axes[0].set_yticks(range(len(indices)))
axes[0].set_yticklabels([feature_names[i] for i in indices])
axes[0].set_xlabel("Relative Importance")
# plot for feature_importance of pre-pruned model
importances = estimator.feature_importances_
indices = np.argsort(importances)
axes[1].set_title("Feature Importances (Pre-Pruned Model)")
axes[1].barh(range(len(indices)), importances[indices], color="violet", align="center")
axes[1].set_yticks(range(len(indices)))
axes[1].set_yticklabels([feature_names[i] for i in indices])
axes[1].set_xlabel("Relative Importance")
# plot for feature_importance of post-pruned model
importances = estimator_2.feature_importances_
indices = np.argsort(importances)
axes[2].set_title("Feature Importances (Post-Pruned Model)")
axes[2].barh(range(len(indices)), importances[indices], color="violet", align="center")
axes[2].set_yticks(range(len(indices)))
axes[2].set_yticklabels([feature_names[i] for i in indices])
axes[2].set_xlabel("Relative Importance")
plt.tight_layout()
plt.show()
8.2 Visualize ftree side by side¶
fig, axes = plt.subplots(1, 3, figsize=(30, 15)) # Create a figure with 1 row and 3 columns
# Visualize decision tree for default model in the first subplot
out = tree.plot_tree(
model_default,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
ax=axes[0] # Specify the subplot to plot on
)
axes[0].set_title("Decision Tree (Default Model)") # Set title for the subplot
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
# Visualize decision tree for pre-pruned model in the second subplot
out = tree.plot_tree(
estimator,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
ax=axes[1] # Specify the subplot to plot on
)
axes[1].set_title("Decision Tree (Pre-Pruned Model)") # Set title for the subplot
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
# Visualize decision tree for post-pruned model in the third subplot
out = tree.plot_tree(
estimator_2,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
ax=axes[2] # Specify the subplot to plot on
)
axes[2].set_title("Decision Tree (Post-Pruned Model)") # Set title for the subplot
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.tight_layout() # Adjust layout to prevent overlapping titles/labels
plt.show()
9 Actionable Insights and Business Recommendations¶
9.1 Observation derived from performance metrics¶
- The post-pruned decision tree exhibits good generalization, performing similarly on both training and test data in terms of recall, precision, and F1-score.
- The pre-pruned tree achieved a perfect recall of 1 on the test set, but its other metrics (precision and F1) were poor, indicating it might be flagging too many non-loan customers as potential loan buyers (high false positives).
- The pre-pruned tree only considered Income, CCAvg, and Family as important features. In contrast, the post-pruned tree is more comprehensive, including Income, Education_2, CCAvg, Education_3, Family, and Age, which is likely to lead to more robust predictions.
- We will select the post-pruned model as the best for this problem for the following reasons:
- It achieves a good recall score of 0.94 on the test set, which is crucial for minimizing false negatives (missing potential loan customers). It also maintains reasonable precision and F1 scores, unlike the pre-pruned tree.
- Although depth is not a direct performance metric, the post-pruned model's structure, considering more features and having a slightly higher depth than some pre-pruned iterations, resulted in better overall performance, particularly the improved test precision compared to the pre-pruned model.
9.2 Business recommedations to the bank?¶
X_test.iloc[:1, :]
| Age | Income | Family | CCAvg | Mortgage | Securities_Account | CD_Account | Online | CreditCard | ZIPCode_91 | ZIPCode_92 | ZIPCode_93 | ZIPCode_94 | ZIPCode_95 | ZIPCode_96 | Education_2 | Education_3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 805 | 55.0 | 132.0 | 3.0 | 5.9 | 307.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 |
%%time
# choosing a data point
test_customer = X_test.iloc[:1, :]
# making a prediction
loan_potential = estimator_2.predict(test_customer)
print(loan_potential)
[1] CPU times: user 3.27 ms, sys: 14 µs, total: 3.29 ms Wall time: 3.13 ms
# making a prediction
loan_purchase_likelihood = estimator_2.predict_proba(test_customer)
print(loan_purchase_likelihood[0, 1])
0.9960822722820765
- This indicates that the model is ~99% confident that the test_customer will purchase a loan.
- Bank's marketing team can deploy this model to identify which of their liability customers have higher potential to purchase a loan.
- Using the likelihood score, the bank can tailor their marketing targets.
- Income and education factors of customers are the most important contributions in the decision-making process.
- Credit card spending habits and family size attributes also play a role.
- Marketing strategies can be tailored by the bank towards customers who have higher income, are more educated, spend highly on their credit cards, and have a bigger family.
- The model is built to reduce false negatives so the marketing team does not lose any potential customers. At the same time, this model exhibits good scores for precision, which means reducing false positives.
!jupyter nbconvert --to html "/content/drive/MyDrive/Colab Notebooks/Project-2/loan_purchase_modelling_dt.ipynb"
[NbConvertApp] Converting notebook /content/drive/MyDrive/Colab Notebooks/Project-2/loan_purchase_modelling_dt.ipynb to html [NbConvertApp] WARNING | Alternative text is missing on 44 image(s). [NbConvertApp] Writing 5821750 bytes to /content/drive/MyDrive/Colab Notebooks/Project-2/loan_purchase_modelling_dt.html